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Abstract Overview 


e As the volume of information available to the public increases exponentially, 
it is crucial that data storage, management, classification, ranking, and 
reporting techniques improve as well. 


ə The purpose of this paper is to discuss how search engines work and what 
modifications can potentially be made to make the engines work more quickly 
and accurately. 


ə In order to do this, we must consider the properties of the search domain (the 
web) difficult to mine. 


ə Finally, we want to ensure that our optimizations we induce will be scalable, 
affordable, maintainable, and reasonable to implement. 
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Background - Section | - Outline 


ə Larry Page and Sergey Brin 
ə Their Main Ideas 
ə Networking Background 


ə Mathematical Background 
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Larry Page and Sergey Brin 


Brin, a native of Moscow, received a 
B.S. degree with honors in mathematics 
and CS from the University of Maryland 
at College Park. During his graduate 
program at Stanford, Sergey met Larry 
Page and worked on the project that 
became Google. 


Larry Page was Google's founding CEO 
and grew the company to more than 200 
employees and profitability before 
moving into his role as president of 
products in April 2001. 


Main Idea 
This presentation is based on the research paper, ” The Anatomy of a Large-Scale 
Hypertextual Web Search Engine” written by Larry Page and Sergey Brin at 

Stanford University. 
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"The Anatomy of a Large-Scale Hypertextual Web Search 


Engine” 


The paper by Larry Page and Sergey Brin focuses mainly on: 
ə Design Goals of the Google Search Engine 
@ The Infrastructure of Search Engines 
ə Crawling, Indexing, and Searching the Web 
ə Link Analysis and the PageRank Algorithm 
ə Results and Performance 


e Future Work 


These topics will be mentioned in this presentation. 


mek (UVM) Search Engines April 15, 2009 5/57 


Networking Background 


Pe Fa 


e This presentation requires that you know the difference between the Web and 
the Internet. 

ə The notion of TCP and UDP connections, sockets, and ports. 

ə How web servers are configured to be publicly viewable. 


ə How CGI (The Common Gateway Interface) is used to transmit data via 
POST and GET 


| planned on leaving myself extra time so if someone doesn’t one of the areas 
above, | encourage you to ask away! 
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Mathematical Background 


The PageRank Algorithm requires previous knowledge of many key topics in 
Linear Algebra, such as: 


ə Matrix Addition and Subtraction 
ə Eigenvectors and Eigenvalues 
ə Power iterations 


ə Dot Products and Cross Products 


| will try to avoid linear algebra when possible but some of these components are 
crucial to understanding PageRank. 
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Introduction - Section II - Outline 


@ Terms and Definitions 
ə Brief Search Engine History 
ə How Search Engines Work 


ə Search Engine Design Goals 
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Terms and Definitions 


hypertext - is interactive text which is encoded with a destination address. When 
the hypertext link is activate (usually by a mouse click, forward arrow, or enter 
key) the user is redirected to the link’s specified destination. 


Hypertextual Search Engine 


A hypertextual search engine - is a system wherein inquiries for data are parsed 
into sets of rules, criteria, and constraints, and then algorithmically applied to a 
population of data in order to ultimately isolate data candidates (search results) 
that best fit the original search criteria. These results are then ‘reported’ or 

displayed back to the user in hypertextual format. 
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Terms and Definitions, Cont'd 


SERP - Search Engine Results Page 


SERP - is an acronym that stands for Search Engine Result Pages. This can either 
relate to the physical look and feel of the results page (generated by the engine) 
or the process in which search engine results are categorized. 


SEO - Search Engine Optimization 


SEO - stands for Search Engine Optimization. SEO can either relate to search 
accuracy or search speed (efficiency). 


mek (UVM) Search Engines April 15, 2009 TOST 


Brief Search Engine History 


In 1990, Alan Emtage, Bill Heelan, and Peter Deutsch developed an engine 
called Archie for indexing FTP (File Transfer Protocol) archives (" The First 
Search Engine, Archie’, Wei Li). 


While Archie was a crucial first step in creating a directory listing of FTP 
sites, Archie did not index the contents of these sites. Archie was followed by 
two search engines which followed the Gopher protocol (Port 70); Veronica 
(1992) and Jughead (1993). 


JumpStation (December 1993) used both a web crawling agent for searching 
and indexing web pages. It provided a hypertextual web form as a interface 
for queries. JumpStation is considered the, ” first WWW resource-discovery 
tool to combine the three essential features of a web search engine (crawling, 


n 


indexing, and searching)” (wikipedia) 


In 1994, Brian Pinkerton (University of Washington) released WebCrawler; 
the first search engine that provided searching of entire document content. 
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How Search Engines Work 


™ 


‘By 


‘Google User 
3. The search results 1. The web server sends the query to the index 


are retumed to the servers, The content inside the index servers is similar 
user in a fraction of a to the index in the back of a book-it tells which pages 
second. contain the words that match any particular query term. 


2. The query travels to the doc 
servers, which actually retrieve the 
stored documents. Snippets are 
generated to describe each search 
result. 


Doe Servers 


e First the user inputs a query for data. he search is submitted to a back-end 
server via the CGI (Common Gateway Interface) Protocol. Google uses GET, 
as opposed to POST, for reasons that will be explained... 


iv 
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How Search Engines Work, Cont'd 


ə The server uses regex (regular expressions) to parse the user's inquiry for 
data. The strings submitted can be permuted, and re-arranged to test for 
spelling errors, and pages containing closely related content. (specifics on 
google’s querying will be shown later) 


ə The search engine searches it’s db for documents which closely relate to the 
user's input. 


ə In order to generate meaningful results, the search engine utilizes a variety of 
algorithms which work together to describe the relative importance of any 
specific search result. 


ə Finally, the engine returns results back to the user. 
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Google Query Evaluation 


ə 1. Query is parsed 
ə 2. Words are converted into wordlDs 
ə 3. Seek to the start of the doclist in the short barrel for every word. 


ə 4. Scan through the doclists until there is a document that matches all the 
search terms. 


ə 5. Compute the rank of that document for the query. 


ə 6. If we are in the short barrels and at the end of any doclist, seek to the 
start of the doclist in the full barrel for every word and go to step 4. 


e 7. If we are not at the end of any doclist go to step 4. 


ə 8. Sort the documents that have matched by rank and return the top k. 
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Single Word Query Ranking 


ə Hitlist is retrieved for single word 


ə Each hit can be one of several types: title, anchor, URL, large font, small 
font, etc. 


Each hit type is assigned its own weight 
Type-weights make up vector of weights 
Number of hits of each type is counted to form count-weight vector 


Dot product of type-weight and count-weight vectors is used to compute IR 
score 


e IR score is combined with PageRank to compute final rank 
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Multi-Word Query Ranking 


Similar to single-word ranking except now must analyze proximity of words in 
a document 


Hits occurring closer together are weighted higher than those farther apart 


Each proximity relation is classified into 1 of 10 bins ranging from a .phrase 
match. to .not even close. 


Each type and proximity pair has a type-prox weight 


Counts converted into count-weights 


Take dot product of count-weights and type-prox weights to computer for IR 
score 
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Search Engine Design Goals 


ə Scalability with web growth 


@ Improved Search Quality 
ə Decrease number of irrelevant results 


ə Incorporate feedback systems to account for user approval 


ə Too many pages for people to view — some heuristic must be used to rank 
sites’ importance for the users. 


@ Improved Search Speed 


ə Even as the domain space rapidly increases 


ə Take into consideration the types of documents hosted 
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Search Engine Infrastructure - Section III - Outline 


ə Resolving and Web Crawling 


ə Indexing and Searching 


o Google’s Infrastructural Model 
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URL Resolving and Web Crawling 


Before a search engine can respond to user inquiries, it must first generate a 
database of URLs (or Uniform Resource Locators) which describe where web 
servers (and their files) are located. URLs or web addresses are pieces of data that 
specify the location of a file and the service that can be used to access it. 


The URL Server's job is to keep track of URL's that have and need to be crawled. 
In order to obtain a current mapping of web servers and their file trees, google’s 
URL Server routinely invokes a series of web crawling agent called Googlebots. 
Web users can also manually request for their URL's to be added to Google's 
URLServer. 
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URL Resolving and Web Crawling 


Web Crawlers: When a web page is ‘crawled’ it has been effectively downloaded. 
Googlebots are Google's web crawling agents/scripts (written in python) which 
spawn hundreds of connections (approximately 300 parallel connections at once) 
to different well connected servers in order to, ” build a searchable index for 
Google's search engine” (wikipedia). 


Brin and Page commented that DNS (Domain NameSpace) lookups were an 
expensive. Gave crawling agents DNS caching abilities. 


Googlebot is known as a well-behaved spider: sites avoid be crawled by adding 
< metaname =" Googlebot” content =" nofollow” / > to the head of the doc (or 
by adding a robots.txt file) 
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Indexing the Web involves three main things: 


e Parsing — parsers must handle typos in meta-tags and missing data in 
trasnactions. Instead of using YACC to generate a CFG (Context Free 
Grammar) parser, Brin and Page used flex to perform lexical analysis. 


Indexing Documents into Barrels — After each document is parsed, every 
word is assigned a wordID. These words and wordID pairs are used to 
construct an in-memory hash table (the lexicon). 


Barrelling Bottleneck — Indexing Phase is difficult to make parallelized 
because the lexicon needs to be shared. They solved this by writing a log of 
14 million words that were not in their lexicon (removed need for shared 
lexicon). This allowed multiple indexers to run in parallel. The log file of 14 
million words is then indexed seperately. 


Sorting — the sorter takes each of the forward barrels and sorts it by wordID 
to produce an inverted barrel for title and anchor hits and a full text inverted 
barrel. This process happens one barrel at a time, thus requiring little 
temporary storage. 
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Searching 


Web Images Maps News Video Gmail more v 


Google the anatomy of a large-scale hyperevtuel web search engine | Msaaran| assesta 
Web 


Results 1 - 10 of about 39,000 for the anatomy of a large-scale hypertextual web search engine. (0.13 seconds) 


The Anatomy of a Large-Scale ertextual Web Search Engine 

The definitive paper by Sergey Brin and Lawrence Page describing PageRank, the algorithm 
that was later incorporated into the Google search engine. 

infolab. stanford. edu/~backrub/qoogle. html - 73k - Cached - Similar pages 


por] The Anatomy of a Search Engine 

File Format: PDF/Adobe Acrobat - View as HTML 

The Anatomy of a Large-Scale Hypertextual. Web Search Engine. Sergey Brin and 
Lawrence Page. Computer Science Department,. Stanford University, Stanford ... 

infolab, stanford, edu/pub/papers/google. pdf - Similar pages 

by S Brin - Cited by 4915 - Related articles - All 265 versions 


The article didn’t specify any speed efficiency issues with searching. Instead they 
focused on making searches more accurate. During the time the paper was 


written, Google queries returned 40,000 results. | only got approximately 39,000 
results when | Google-seached their paper... But the results fetch only took 0.13 
seconds. (Not bad!) 


April 15, 2009 
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Google's Infrastructure Overview 


Google's architecture includes 14 major components: an URL Server, multiple 
Web Crawlers, a Store Server, a Hypertextual Document Repository, an Anchors 
database, an URL Resolver, a Hypertextual Document Indexer, a Lexicon, multiple 
short and long Barrels, a Sorter Service, a Searcher Service, and a PageRank 
Service. These systems were implemented in C and C++ on Linux and Solaris 
systems. 
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Infrastructure Part | 


Google's Architecture 


(from http://www.ics.uci.edu/~scott/google.htm) 


Multiple crawlers run in parallel. Each crawler 
keeps its own DNS lookup cache and ~300 open 
connections open at once. Compresses & stores web pages 


Keeps track of URLs 


that have and need oe (sil 


to be crawled 


Stores each link and 
text surrounding link. 


Converts relative URLs 
into absolute URLs. 


Uncompresses and parses Contains full htm! of every web 
documents. Stores link page. Each document is prefixed 
information in anchors file. by docID, length, and URL. 
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Infrastructure Part Il 


Google's Architecture 


(from http://www.ics.uci.edu/~scott/google.htm) 


Maps absolute URLs into doclDs stored in Doc 
Index. Stores anchor text in “barrels”. 
Generates database of links (pairs of doclds). 


Parses & distributes hit 


a Cu eo lists into “barrels.” 
URL Resol Indexer 


In-memory hash table that 
maps words to wordlds. 
Contains pointer to doclist in 
barrel which wordid falls into. 


Partially sorted forward indexes 
sorted by doclD. Each barrel stores 
hitlists for a given range of wordIDs. 


Creates inverted index 


DocID keyed index where each entry includes info whereby document list 
such as pointer to doc in repository, checksum, containing dociD a nd 
statistics, status, etc. Also contains URL info if doc hitlists can be retrieved 
has been crawled. If not just contains URL. given wordID. 
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Infrastructure Part III 


Google's Architecture 


(from http:/Awww.ics.uci.edu/~scott/google.htm) 


2 kinds of barrels. Short barrel which contain hit list which 
include title or anchor hits. Long barrels for all hit lists. 


List of word IDs produced 
by Sorter and lexicon 
created by Indexer used 
to create new lexicon 
used by searcher. 
Lexicon stores ~14 
million words. 


New lexicon keyed by wordlD, inverted doc index 
keyed by doclD, and PageRanks used to answer queries 
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Search Engine Optimizations - Section IV - Outline 


Outline 
ə Significance of SEO's 


ə Elementary Ranking Schemes 


ə What Makes Ranking Optimization Hard? 
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The Significance of SEO's 


ə Too many sites for humans to maintain ranking 
ə Humans are biased — have different ideas of what " good” and "bad” are. 


ə With a search space as a large as the web, optimizing order of operations and 
data structures have huge consequences. 


ə Concise and well developed heuristics lead to more accurate and quicker 
results 


ə Different methods and algorithms can be combined (see: RANK MERGING) 
to increase overall efficiency. 
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Elementary SEO's for Ranking 


ə Word Frequency Analysis within Pages 


ə Implicit Rating Systems - The search engine considers how many times a 
page has been visited or how long a user has remained on a site. 


ə Explicit Rating Systems - The search engine asks for your feedback after 
visiting a site. 


ə Most feedback systems have severe flaws (but can be useful if implemented 
correctly and used with other methods) 


ə More sophisticated: Weighted Heuristic Page Analysis, Rank Merging, and 
Manipulation Prevention Systems 
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What Makes Ranking Optimization Hard? 


ə Link Spamming 

e@ Keyword Spamming 

ə Page hijacking and URL redirection 

ə Intentionally inaccurate or misleading anchor text 


e Accurately targeting people’s expectations 


a a = = = DAG 
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PageRank - Section V - Outline 


Outline 


ə Link Analysis and Anchors 


@ Introduction to PageRank 

ə Calculating Naive PR 

ə Example 

ə Calculating PR using Linear Algebra 
e@ Implementing PageRank 

ə Problems with PR 


ə Improving PageRank 
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Link Analysis and Anchors 


I£ oe 


ə Hypertextual Links are convenient to users and represent physical citations on 
the Web. 


ə Anchor Text Analysis: 
< ahref =" http : //www.google.com” >Anchor Text< /a > 


e@ Can be more accurate description of target site than target sites text itself 


ə Can point at non-HTTP or non-text; such as images, videos, databases, 
pdf's, ps’s, etc. 


ə Also, anchors make it possible for non-crawled pages to be discovered. _ 
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Introduction to PageRank 


IEEE 


e Rights belong to Google, patent belongs to Stanford University 

ə Top 10 IEEE ICDM data mining algorithm 

e Algorithm used the rank the relative importance of pages within a network. 
ə PageRank idea based on the elements of democrating voting and citations. 


ə The PR Algorithm uses logarithmic scaling; the total PR of a network is 1. 


= = = = ) & (C 
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Introduction to PageRank 


e Rights belong to Google, patent belongs to Stanford University PageRank is 
a link analysis algorithm that ranks the relative importance of all web pages 
within a network. It does this by looking at three web page features: 


ə 1. Outgoing Links - the number of links found in a page 
ə 2. Incoming Links - the number of times other pages have sited this page 
ə 3. Rank - A value representing the page’s relative importance in the network 
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Calculating Naive PageRank 


The Page Rank Equation 


PR(71) 
Ca) 


RU 


+.. + a 


PR(A) = (1 — d) + d( 


PR(A) = The PageRank of page A 
C(A) or L(A) = the total number of outgoing links from page A 


d = The damping factor. Induces randomness to prevent certain pages from 
gaining too much rank. (1-d) ensures adds the values lost by multiplying by the 
damping factor to ensure the sum of all web pages in the network is 1. The 
damping factor also enforces a random surfing model which is comparable to 
Markov Chains. 


Calculating Naive PageRank, Cont'd 


The PageRank of a page A, denoted PR(A), is decided by the quality and 
quantity of sites linking or citing it. Every page 7; that links to page A is 
essentially casting a vote, deeming page A important. By doing this, T; 
propogates some of it’s PR to page A. 

How can we determine how much rank rank an individual page 7; gives to A? 


T; may contain many links — not just a single link to page A. 


T; must propogate it’s page rank equally to it’s citations. Thus, we only want to 
give page A a fraction of the PR(T;). 


The amount of PR that 7; gives to A is be expressed as the damping value times 
the PR(T;) divided by the total number of outgoing links from T;. 
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Naive Example 


Computing PageRank (Step 0) 


Initialize so total @_1 
rank sums to 1.0 a 
0.33 
0.33 
0.33 


Computing PageRank (Step 2) 


Compute weights PA 1 & 
based on in-edges + > ANY 
0.50 ’ 
0.33 0.17 
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Computing PageRank (Step 1) 


Propagate weights (k+l) ES (k) 
across out-edges = 2 N X 
Jeg +j 
0.33 
0.17 
0.33 
0.17 


Computing PageRank (Convergence) 


Search Engines 


1 (k) 
DaT 
JB, +j 
0.40 


0.4 0.2 
www.seas.upenn.edu/~zives/cis555/slides/G-WebSearch.ppt “^ 9 © 
April 15, 2009 


Calculating PageRank using Linear Algebra 


o 1/3000 
o 0010 
B=/1/2 1/3001 
0 1/3100 
1/2 0 000 


Typically PageRank computation is done by finding the principal eigenvector of 
the Markov chain transition matrix. The vector is solved using the iterative power 
method. Above is a simple Naive PageRank setup which expresses the network as 
a link matrix. 


ə More examples can be found at: 
ə http: //www.ianrogers.net/google-page-rank/ 
o http: //www.webworkshop.net/pagerank.html 


o http://www.math.uwaterloo.ca/ hdesterc/websiteW/ [...] 
Data/presentations/pres2008/ChileApr2008.pdf _ 
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Calculating PageRank using Linear Algebra, Con 


(dominant 

eigenvector) 
For those interested in the actual PageRank Calculation and Implementation 
process (involving heavier linear algebra), | invite you to view my ” Additional 
Resources” slide. 
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ez? 
eo 


PageRank 


ə Convert all page URLs into unique integers (for IDs) 

ə Sort link structure by Parent ID 

ə Dangling links removed from link db 

e Several strategies: they choose iterate until convergence 

ə Assign initial values to nodes (based on assumed importance) 


ə Run the model until it converges. To ensure accuracy, repeat process with 
correct initialized weights. 
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Disadvantages and Problems 


Gee 


Rank Sinks: Occur when pages get in infinite link cycles. 


Spider Traps: A group of pages is a spider trap if there are no links from 
within the group to outside the group. 


Dangling Links: A page contains a dangling link if the hypertext points to a 
page with no outgoing links. 


ə Dead Ends: are simply pages with no outgoing links. 


- Solution to all of the above: By introducing a damping factor, the figurative 
random surfer stops trying to traverse the sunk page(s) and will either follow 
a link randomly or teleport to a random node in the network. 
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Improving PageRank 


ə Based on, " The PageRank Citation Ranking: Bringing Order to the Web”, 
choosing initial values for pages in a network will not affect the final values, 
just the rate of convergence. 


ə By choosing initial values wisely, the convergence of the algorithm will be 
quicker. 


ə Local Graph Partitioning using PageRank Vectors *complicated* 
http: //ieeexplore.ieee.org /stamp/stamp.jsp?arnumber=04031383 


e Distributed PR Comp. based on Iterative Aggregation-Dissagregation 
Methods (IAD). 


ə Alternate Infrastructure to avoid costly data-seeks (lookups) 


mek (UVM) Search Engines April 15, 2009 42 / 57 


Competition and Future Research - Section VI - Outline 


Outline 


ə Jon Kleinberg’s HITs 

ə Lempel and Moran's SALSA 
ə Ask.com’s ExpertRank 

ə Microsoft's BrowserRank 


ə Yahoo!’s TrustRank (also from Stanford!) 
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Jon Kleinberg's HITs Algorithm 


Hyperlink-Induced Topic Search (HITS) (also known as Hubs and authorities) is a 
link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It 
determines two values for a page: its authority, which estimates the value of the 
content of the page, and its hub value, which estimates the value of its links to 
other pages. 


It is processed on a small subset of .relevant. documents, not all documents as 
was the case with PageRank. 
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Teoma aka Ask.com’s ExpertRank 


Teomo was unique because of its link popularity algorithm. Unlike Google's 
PageRank, Teoma’s technology (Subject-Specific Popularity) analyzed links in 
context to rank a web page’s importance within its specific subject. For instance, 
a web page about ‘baseball’ would rank higher if other web pages about ‘baseball’ 
link to it. 


= 
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Lampel and Moran’s SALSA 


SALSA, a new stochastic approach for link-structure analysis, which examines 
random walks on graphs derived from the link-structure. Read more at - 
http: //portal.acm.org/citation.cfm?id=383041 
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Microsoft's BrowseRank 


User Behavior Data User Browsing Graph 


hansis URLa, 
URLs, |Userz 
os sse IURL2, 


URLa [e ++ User, 

URL,, URLs, Times, Input 
URL, Times, Click 
URLs. Times, Click 


BrowseRank is based on the time a user spends on a site. 


— Table 3: Top 20 websites by three different algorithms 
g 1006408 m No. PageRank TrustRank BrowseRank 
$ 1 adobe.com adobe.com myspace.com 
pe OE 2 passport.com yahoo.com msn.com 
é ioiii 3 msn.com google.com yahoo.com 
= 4 microsoft.com msn.com youtube.com 
2 1.006105 5 yahoo.com microsoft.com live.com 
£ 6 google.com passport.net facebook.com 
4 1.00E+04 7 mapquest.com ufindus.com google.com 
8 | miibeian.gov.cn sourceforge.net ebay.com 
1.00E+03 i i i ! 9 w3.org myspace.com hi5.com 
1 10 100 1000 10000 10 | godaddy.com wikipedia.org bebo.com 
Observed Staying Time (seconds) 11 | statcounter.com phpbb.com orkut.com 


a © ) qQ 
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Yahoo!'s TrustRank (also from Stanford!) 


TrustRank method calls for selecting a small set of seed pages to be evaluated by 
an expert. Once the reputable seed pages are manually identified, a crawl 
extending outward from the seed set seeks out similarly reliable and trustworthy 
pages. TrustRank’s reliability diminishes as documents become further removed 
from the seed set. 


mek (UVM) Search Engines April 15, 2009 48 / 57 


Conclusion - Section VII - Outline 


ə Experimental Results (Benchmarking) 
ə The Future of PageRank and Other Applications 


e Exam Questions 


ə Bibliography 


a © = = 
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Benchmarking Convergence 


Convergence of PageRank Computation 
100000000 a 
| Œ 322 Million Links 
| © 161 Million Links 


10000000 


1000000 


100000 


10000 


1000 


‘Total Ditterence trom Previous Iteration 


100 


10 T 
oO 7.5 15 22.5 30 37.5 45 52.5 


Number of Iterations 


@ convergence of the Power Method is FAST! 322 million links converge almost 
as quickly as 161 million. 


ə Doubling the size has very little effect on the convergence time. 
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Experimental Results 


Storage Requirements 


(from http:/Awww.ics.uci.edu/~scott/goo gle htm) 


At the time of publication, Google had the following 
statistical breakdown for storage requirements: 


Storage Statistics (Values in GB) 


Links Database, 3.9, 4% 


Document Index, 9.7, 9% 


Lexicon, 0.293, 0% 


Compressed Repository, 
53.5, 49% 


Inverted Index, 41,3, 38% 


ə Data structures obviously highly optimized for space 
e Infrastructure setup for high parallelization. 
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Benchmarking of Advanced PageRank Implementations 


uk iteration convergence P uk time convergence 


Error 


i ` 
10° fi 1 i ji i fi i 10 i i i i i i 
0 10 20 30 40 50 60 70 80 o 5 10 15 _ 20 2 30 35 40 
Iteration Time (sec) 


(a) Convergence Iterations (b) Convergence Time 


ə Parallel PageRank with PageRank Iterations versus Jacobi, GMRES, BiCG, 
and GiCGSTAB: 
http: //www.stanford.edu/ dgleich/projects/ppagerank/index.html 


ə Compares convergence and time for different implementations: 
http://www.stanford.edu/ dgleich/publications/ 
gleich-zhukov-sc-pprank-abstract.pdf 


o http://www.cs.cmu.edu/ yangboz/cikm05_pagerank. ppt 
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Future of PR 


e A version of PageRank has recently been proposed as a replacement for the 
traditional Institute for Scientific Information (ISI) impact factor,[13] and 
implemented at eigenfactor.org. Instead of merely counting total citation to a 
journal, the “importance” of each citation is determined in a PageRank 
fashion. 


ə Because PageRank roughly corresponds to a random web surver model, PR 
can be used to estimate web traffic. 
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Final Exam Questions 


@ (1) Please state the PageRank formula and describe it’s components 


The Page Rank Equation 


PR(71) 
C(71) 


_, PR(To) 
oe) 


PR(A) = (1—d) +. d( aay ) (2) 


PR(A) = The PageRank of page A 


C(A) or L(A) = the total number of outgoing links from page A 


d = The damping factor. 
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ə (2) Given a network of five node:rank pairs [a:.2,b:.3,c:.1,d:.3,e:1], explain 
how to generate the corresponding a Naive PR Link Matrix. Assume pages 
can link to themselves (there are not zeros along the diaganol) 


The answer will be a 5 by 5 matrix where the labels for the rows and columns are 
the pages in the network. The first element of every row belongs to the first page, 
..., and the fifth and final element in every row belongs to the the fifth page. 
Likewise the first element in every column belongs to the first page and the fifth 
element of every column belongs to the fifth page. Each item in the matrix is thus 
correlated with two pages — it’s row represents one page and it’s column 
represents another. In this model, the value within any entry represents the 
fraction of rank the column page provides the row page. To fill in any particular 
matrix entry, find the node in the graph associated with the entry’s column and 
count it’s number of outgoing links. The entry's value is the column page's rank 
over this outgoing link count. 
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Final Exam Questions 


ə (3) Explain the signifiance of the damping factor. What values are it 
between? What problems does it solve? What factors does it account for? 


The damping factor induces randomness to prevent certain pages from gaining too 
much rank. The value (1-d) from the PR equation ensures that the values lost by 
multiplying by the damping factor are added back to the system (such that the 
network maintains an accumulative PR of 1). The damping factor also enforces a 
random surfing model which is comparable to Markov Chains. The damping factor 
handles dead ends and infinit loops by offering a probability of a random teleport 
or link traversal. 
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Additional Resources 


ə http://cis.poly.edu/suel/papers/pagerank.pdf - PR via The SplitAccumulate 
Algorithm, Merge-Sort, etc. 


ə http://nlp.stanford.edu/ manning/papers/PowerExtrapolation.pdf -PR via 
Power Extrapolation: includes benchmarking 


ə http: //www.webworkshop.net/pagerank_calculator.php - neat little tool for 
PR calculation with a matrix 


o http://www.miislita.com /information-retrieval-tutorial/ [...] 
matrix-tutorial-3-eigenvalues-eigenvectors.html 
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